Skip to content

Conversation

@MarceloRobert
Copy link
Collaborator

@MarceloRobert MarceloRobert commented Oct 15, 2025

Changes

Refactors ingester code:

  • Moves constants to a specific file (also makes constants able to be changed from env vars)
  • Moves some functions to more specific files
  • Adds more typing indicators
  • Lowers the complexity of the most complex functions

How to test

The behavior should be the same as before. So:

  • Start the ingester
    • Create a temporary folder for the submissions as backend/kernelCI_app/management/commands/tmp_submissions (you can change the path, just update the command below)
    • Make sure you have your trees-name.yml file somewhere. It usually is on /backend/volume_data as a result of the treeproof command. If you are feeling lazy and you don't want to run the command, there's an example file here (Google Drive link)
    • Make sure you have a couple of example submission files for testing. You can grab some in this zip file here (Google Drive link)
    • Export the environment variables in your terminal, pointing one of the databases to the local one. Make sure the local one is set as the default one with USE_DASHBOARD_DB=True
    • Now you'll be able to run the command with poetry run python3 ./manage.py monitor_submissions --spool-dir kernelCI_app/management/commands/tmp_submissions --trees-file volume_data/trees-name.yaml (being in the /backend directory, update the paths if necessary)
  • Insert submissions in the folder and check that they are being inserted in the database
    You can also test the ingester on docker, make sure the environment variables are on .env.backend and use docker compose run --rm backend before running the command shown above.

Closes #1571

@MarceloRobert MarceloRobert self-assigned this Oct 15, 2025


def build_instances_from_submission(data: dict[str, Any]) -> dict[TableNames, list]:
def build_instances_from_submission(data: dict[str, Any]) -> ProcessedSubmission:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to avoid using Any? Maybe with generic

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this data is the submission json, so the Any means any value that an item can have: str, int, list, dict, none... IMO in this case it is ok to leave it as Any (and other similar cases around these functions too). This use will at least define that they dict keys are str, otherwise we would be using just dict which just means dict[Any, Any]

STORAGE_BASE_URL = os.environ.get(
"STORAGE_BASE_URL", "https://files-staging.kernelci.org"
)
CONVERT_LOG_EXCERPT = False # If True, convert log_excerpt to output_files url
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not an env as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything was defined statically in the file, I just added verbose as an env var. I'll make the new constants file and I can make them all env vars 👍

)
CONVERT_LOG_EXCERPT = False # If True, convert log_excerpt to output_files url

TREES_FILE = "/app/trees.yaml"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you think a new file:

constants/ingester.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

Comment on lines 40 to 41
if VERBOSE:
logger.info("Uploading logexcerpt for %s to %s", id, upload_url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk this VERBOSE variable could be more descriptive

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the description for the VERBOSE variable or the string in this specific logger.info?

str: On success upload: the reference url. On failed upload: the original logexcerpt
"""

upload_url = f"{STORAGE_BASE_URL}/upload"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outside of function

@MarceloRobert MarceloRobert changed the title WIP: Refactor: ingester changes Refactor: ingester changes Oct 16, 2025
@MarceloRobert MarceloRobert marked this pull request as ready for review October 16, 2025 20:32
"""Write debug/perf output to stdout. Logger was unreliable in some environments."""
try:
print(
f"[{time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())}] {msg}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested double quotes on f-string

if os.path.isfile(os.path.join(spool_dir, f)) and f.endswith(".json")
]
if not json_files:
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to return like this instead of tuple? I think we should return empty list and 0 files
return [], 0 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this simple return should be for the ingest_submissions_parallel, so I returned the [], 0 here and added another condition there

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed with the rebase, not needed anymore

)
)
json_files, total_bytes = get_spool_files(spool_dir)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not json_files: maybe? i there is no spool files, we might need to put small delay and check again

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had moved the if not json_files to the other function but it should be here. I'll fix it

"""
Consumes the list of objects and tries to insert them into the database.
"""
total = len(issues_buf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think should be something like:

total = (
    len(issues_buf)
    + len(checkouts_buf)
    + len(builds_buf)
    + len(tests_buf)
    + len(incidents_buf)
)

from kernelCI_app.models import Builds, Checkouts, Incidents, Issues, Tests

type TableNames = Literal["issues", "checkouts", "builds", "tests", "incidents"]
type TableModels = Issues | Checkouts | Builds | Tests | Incidents
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure in this part...
TableModels = Issues | Checkouts | Builds | Tests | Incidents denotes instances, but MODEL_MAP stores classes and consume_buffer expects a class

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it denote instances? IIUC it denotes the classes themselves, hovering over the TableModels variable I see (type) TableModels = type[Issues] | type[Checkouts]... directly

INGEST_FLUSH_TIMEOUT_SEC = float(os.environ.get("INGEST_FLUSH_TIMEOUT_SEC", 2.0))
except (ValueError, TypeError):
logger.warning("Invalid INGEST_FLUSH_TIMEOUT_SEC, using default 2.0")
INGEST_FLUSH_TIMEOUT_SEC = 5.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If default is 2.0, we must set 2.0 here too

from typing import Literal
from kernelCI_app.models import Builds, Checkouts, Incidents, Issues, Tests

type TableNames = Literal["issues", "checkouts", "builds", "tests", "incidents"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This require Python >=3.12 (PEP 695). Do we use 3.12 and above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do, it's set on pypoetry.toml and we have been using this format for a while

@MarceloRobert MarceloRobert force-pushed the refactor/ingester branch 2 times, most recently from acc8bd1 to 40ba06b Compare October 29, 2025 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor dashboard ingester

3 participants